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Abstract 

The classical mixture of Gaussians model is 
related to K-means via small-variance asymp- 
totic s\ as the covariances of the Gaussians tend 
to zero, the negative log-likelihood of the mix- 
ture of Gaussians model approaches the K-means 
objective, and the EM algorithm approaches the 
K-means algorithm. Kulis & Jordan (2012) used 
this observation to obtain a novel K-means-like 
algorithm from a Gibbs sampler for the Dirich- 
let process (DP) mixture. We instead consider 
applying small-variance asymptotics directly to 
the posterior in Bayesian nonparametric mod- 
els. This framework is independent of any spe- 
cific Bayesian inference algorithm, and it has 
the major advantage that it generalizes immedi- 
ately to a range of models beyond the DP mix- 
ture. To illustrate, we apply our framework to the 
feature learning setting, where the beta process 
and Indian buffet process provide an appropriate 
Bayesian nonparametric prior. We obtain a novel 
objective function that goes beyond clustering to 
learn (and penalize new) groupings for which we 
relax the mutual exclusivity and exhaustivity as- 
sumptions of clustering. We demonstrate several 
other algorithms, all of which are scalable and 
simple to implement. Empirical results demon- 
strate the benefits of the new framework. 



1. Introduction 

Clustering is a canonical learning problem and arguably the 
dominant application of unsupervised learning. Much of 
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the popularity of clustering revolves around the K-means 
algorithm; its simplicity and scalability make it the pre- 
ferred choice in many large-scale unsupervised learning 
problems — even though a wide variety of more flexible al- 
gorithms, including those from Bayesian nonparametrics, 
have been developed since the advent of K-means (Stein- 
ley, 2006; Jain, 2010). Indeed, Berkhin (2006) writes that 
K-means is "by far the most popular clustering tool used 
nowadays in scientific and industrial applications." 

K-means does have several known drawbacks. For one, 
the K-means algorithm clusters data into mutually exclu- 
sive and exhaustive clusters, which may not always be the 
optimal or desired form of latent structure for a data set. For 
example, pictures on a photo-sharing website might each be 
described by multiple tags, or social network users might 
be described by multiple interests. In these examples, di fea- 
ture allocation in which each data point can belong to any 
nonnegative integer number of groups — now called fea- 
tures — is a more appropriate description of the data (Grif- 
fiths & Ghahramani, 2006; Broderick et al., 2013). Second, 
the K-means algorithm requires advance knowledge of the 
number of clusters, which may be unknown in some ap- 
plications. A vast literature exists just on how to choose 
the number of clusters using heuristics or extensions of K- 
means (Steinley, 2006; Jain, 2010). A recent algorithm 
called DP-means (Kulis & Jordan, 2012) provides another 
perspective on the choice of cluster cardinality. Recall- 
ing the small- variance asymptotic argument that takes the 
EM algorithm for mixtures of Gaussians and yields the 
K-means algorithm, the authors apply this argument to a 
Gibbs sampler for a Dirichlet process (DP) mixture (An- 
toniak, 1974; Escobar, 1994; Escobar & West, 1995) and 
obtain a K-means-like algorithm that does not fix the num- 
ber of clusters upfront. 

Notably, the derivation of DP-means is specific to the 
choice of the sampling algorithm and is also not immedi- 
ately amenable to the feature learning setting. In this pa- 



per, we provide a more general perspective on these small- 
variance asymptotics. We show that one can obtain the ob- 
jective function for DP-means (independent of any algo- 
rithm) by applying asymptotics directly to the MAP esti- 
mation problem of a Gaussian mixture model with a Chi- 
nese Restaurant Process (CRP) prior (Blackwell & Mac- 
Queen, 1973; Aldous, 1985) on the cluster indicators. The 
key is to express the estimation problem in terms of the 
exchangeable partition probability function (EPPF) of the 
CRP (Pitman, 1995). 

A critical advantage of this more general view of small- 
variance asymptotics is that it provides a framework for 
extending beyond the DP mixture. The Bayesian non- 
parametric toolbox contains many models that may yield — 
via small-variance asymptotics — a range of new algorithms 
that to the best of our knowledge have not been discovered 
in the K-means literature. We thus view our major contri- 
bution as providing new directions for researchers working 
on K-means and related discrete optimization problems in 
machine learning. 

To highlight this generality, we show how the framework 
may be used in the feature learning setting. We take as 
our point of departure the beta process (BP) (Hjort, 1990; 
Thibaux & Jordan, 2007), which is the feature learning 
counterpart of the DP, and the Indian Buffet Process (IBP) 
(Griffiths & Ghahramani, 2006), which is the feature learn- 
ing counterpart of the CRP. We show how to express the 
corresponding MAP inference problem via an analogue of 
the EPPF that we refer to as an "exchangeable feature prob- 
ability function" (EFPF) (Broderick et al., 2013). Taking an 
asymptotic limit we obtain a novel objective function for 
feature learning, as well as a simple and scalable algorithm 
for learning features in a data set. The resulting algorithm, 
which we call BP-means, is similar to the DP-means algo- 
rithm, but allows each data point to be assigned to more 
than one feature. We also use our framework to derive sev- 
eral additional algorithms, including algorithms based on 
the Dirichlet-multinomial prior as well as extensions to the 
marginal MAP problem in which the cluster/feature means 
are integrated out. We compare our algorithms to existing 
Gibbs sampling methods as well as existing hard clustering 
methods in order to highlight the benefits of our approach. 

2. MAP Asymptotics for Clustering 

We begin with the problem setting of Kulis & Jordan 
(2012) but diverge in our treatment of the small-variance 
asymptotics. We consider a Bayesian nonparametric 
framework for generating data via a prior on clusterings 
and a likelihood that depends on the (random) clustering. 
Given prior and likelihood, we form a posterior distribution 
for the clustering. A point estimate of the clustering (i.e., 
a hard clustering) may be achieved by choosing a cluster- 



ing that maximizes the posterior; the result is a maximum a 
posteriori (MAP) estimate. 

Consider a data set Xi,...,XAr, where Xn is a D- 
component vector. Let denote the (random) number 
of clusters. Let Znk equal one if data index n belongs to 
cluster k and otherwise, so there is exactly one value of 
k for each n such that Znk = 1- We can order the cluster 
labels k so that the first are non-empty (i.e., Znk = 1 
for some n for each such k). Together and Zi.n^i.x+ 
describe a clustering. 

The Chinese restaurant process (CRP) (Blackwell & Mac- 
Queen, 1973; Aldous, 1985) yields a prior on and 
Zi:N,i:K+ ^s follows. Let 6> > bc a hyperparameter of 
the model. The first customer (data index 1) starts a new 
table in the restaurant; i.e., zi^i = 1. Recursively, the nth 
customer (data index n) sits at an existing table k with prob- 
ability in proportion to the number of people sitting there 
(i.e., in proportion to Sn-i,k '= Z]m=i ^^k) and at a new 
table with probability proportional to 0. 

Suppose the final restaurant has tables with N total 
customers sitting according to Zi.7v,i:k+- Then the proba- 
bility of this clustering is found by multiplying together the 
N steps in the recursion described above: 

a formula that is known as an exchangeable partition prob- 
ability function (EPPF) (Pitman, 1995). 

As for the likelihood, a common choice is to assume that 
data in cluster k are generated from a Gaussian with a 
cluster- specific mean /j^k and shared variance a'^Io (where 
Id is the identity matrix of size DxD and > 0), and we 
will make that assumption here. Suppose the /i^ are drawn 
iid Gaussian from a prior with mean in every dimen- 
sion and variance p'^Io for some hyperparameter > 0: 
= nf=i-^(M/c|0,p^/D). Then the likelihood of 
a data set x = xi:n given clustering z = Zi.n^i.x+ and 
means /i = is as follows: 

K+ 

F{x\z,fl) = l[ l[ M{XnW.(J^lD)- 

k=l n:Zn,k = ^ 

Finally, the posterior distribution over the clustering 
given the observed data, P(2:,/i|x), is calculated from 
the prior and likelihood using Bayes rule: F{z^fi\x) ex 
P(x|z,/i)P(/i)P(z). We find the MAP point estimate for 
the clustering and cluster means by maximizing the poste- 
rior: argmax^+^^ P(z, Note that the point estimate 
will be the same if we instead minimize the negative log 
joint likelihood: argmin^+^^ — logP(z, /i, x). 
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In general, calculating the posterior or MAP estimate is dif- 
ficult and usually requires approximation, e.g. via Markov 
Chain Monte Carlo or a variational method. A different 
approximation can be obtained by taking the limit of the 
objective function in this optimization as the cluster vari- 
ances decrease to zero: ^0. Since the prior allows 
an unbounded number of clusters, we expect that taking 
this limit alone will result in each data point being assigned 
to its own cluster. To arrive at a limiting objective func- 
tion that favors a non-trivial cluster assignment, we modu- 
late the number of clusters via the hyperparameter 6, which 
varies linearly with the expected number of clusters in the 
prior. In particular, we choose some constant > and 
let 

^ = exp(-AV(2a2)), 
so that, e.g., 6> ^ as ^ 0. 

Given this dependence of on and letting 0, we 

find that-2cr2 logPf ) satisfies 



J2 E \\Xn-^lkf + {K+-l)\' 



(2) 



k=l n:Zn 



where /(cr^) ~ g{o-'^) here denotes f{a'^)/g{a'^) ^ 1 as 
cr^ ^ 0. The double sum originates from the exponential 
function in the Gaussian data likelihood, and the penalty 
term — reminiscent of an AIC penalty (Akaike, 1974) — 
originates from the CRP prior (Sup. Mat. A). 

From Eq. (2), we see that finding the MAP estimate of the 
CRP Gaussian mixture model is asymptotically equivalent 
to the following optimization problem (Sup. Mat. A): 



argmm 



K+ 



K+ 



{K^ - 1)A^ (3) 



^'/^ k=l n:znk = l 



Kulis & Jordan (2012) derived a similar objective function, 
which they called the DP -means objective function (a name 
we retain for Eq. (3)), by first deriving a K-means- style al- 
gorithm from a DP Gibbs sampler. Here, by contrast, we 
have found this objective function directly from the MAP 
problem, with no reference to any particular inference algo- 
rithm and thereby demonstrating a more fundamental link 
between the MAP problem and Eq. (3). In the following, 
we show that this focus on limits of a MAP estimate can 
yield useful optimization problems in diverse domains. 

Notably, the objective in Eq. (3) takes the form of the K- 
means objective function (the double sum) plus a penalty 
of A^ for each cluster after the first; this offset penalty is 
natural since any partition of a non-empty set must have at 
least one cluster.^ Once we have Eq. (3), we may consider 



^The objective of Kulis & Jordan (2012) penalizes all 
clusters; the optimal inputs are the same in each case. 



efficient solution methods; one candidate is the DP-means 
algorithm of KuHs & Jordan (2012). 

3. MAP Asymptotics for Feature Allocations 

Once more consider a data set xi:n, where IS a D- 
component vector. Now let denote the (random) num- 
ber of features in our model. Let Znk equal one if data 
index n is associated with feature k and zero otherwise; in 
the feature case, while there must be a finite number of k 
values such that z^fe = 1 for any value of n, it is not re- 
quired (as in clustering) that there be exactly a single such 
k or even any such k. We order the feature labels k so that 
the first features are non-empty; i.e., we have z^k = 1 
for some n for each such k. Together and Zi.n^i.x+ 
describe a feature allocation. 

The Indian buffet process (IBP) (Griffiths & Ghahramani, 
2006) is a prior on Zi.n^i.x+ that places strictly positive 
probability on any finite, nonnegative value of K^. Like 
the CRP, it is based on an analogy between the customers 
in a restaurant and the data indices. In the IBP, the dishes 
in the buffet correspond to features. Let 7 > be a hy- 
perparameter of the model. The first customer (data index 
1) samples ~ Pois(7) dishes from the buffet. Recur- 
sively, when the nth customer (data index n) arrives at the 
buffet, Xlm=\ dishes have been sampled by the pre- 
vious customers. Suppose dish k of these dishes has been 
sampled Sn-i,k times by the first n — 1 customers. The nth 
customer samples dish k with probability Sn-i,k/^- The 
nth customer also samples ~ Pois(7/n) new dishes. 

Suppose the buffet has been visited by N customers who 
sampled a total of dishes. Let z = Zi.]y^i.x+ repre- 
sent the resulting feature allocation. Let H be the number 
of unique values of the zi:N,k vector across k; let Kh be 
the number of k with the hth unique value of this vector. 
We calculate an "exchangeable feature probability func- 
tion" (EFPF) (Broderick et al., 2013) by multiplying to- 
gether the probabilities from the N steps in the description 
and find that ¥{z) equals (Griffiths & Ghahramani, 2006) 



K+ 

7^ exp 



N 

^ SN,k 



(4) 



It remains to specify a probability for the observed data 
X given the latent feature allocation z. The linear Gaus- 
sian model of Griffiths & Ghahramani (2006) is a natu- 
ral extension of the Gaussian mixture model to the feature 
case. As previously, we specify a prior on feature means 

/i/e A/'(0, p'^Id) for some hyperparameter > 0. Now 
data point n is drawn independently with mean equal to the 

sum of its feature means, Yl^^i ZnkPk and variance a'^Io 
for some hyperparameter cr^ > 0. In the case where each 
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data point belongs to exactly one feature, this model is just 
a Gaussian mixture. We often write the means sls sl K x D 
matrix A whose kth row is /i^. Then, writing Z for the ma- 
trix with (n, /c) element Znk and X for the matrix with nth 
row Xn, we have P(X|Z, A) equal to 



1 



(27r(j2)^^/2 



exp 



tr((X - ZAy{X - ZA)) 
2^2 



(5) 



As in the clustering case, we wish to find the joint MAP 
estimate of the structural component Z and group- specific 
parameters A. It is equivalent to find the values of Z and A 
that minimize — logP(X, Z, A). Finally, we wish to take 
the limit of this objective as ^0. Lest every data point 
be assigned to its own separate feature, we modulate the 
number of features in the small-a^ limit by choosing some 
constant > and setting 7 = exp(— A^/ (2cr^)). 

Letting 0, we find that asymptotically (Sup. Mat. B) 

-2(7^ logP(X, Z, A) - tr[(X-ZA)'(X-ZA)]+K+A^ 

The trace originates from the matrix Gaussian, and the 
penalty term originates from the IBP prior. 

It follows that finding the MAP estimate for the feature 
learning problem is asymptotically equivalent to solving 
the following optimization problem: 

argmintr[(X - ZA)\X - ZA)] + i^+A^ (6) 

K+,Z,A 



We follow Kulis & Jordan (2012) in referring to the un- 
derlying random measure in denoting objective functions 
derived from Bayesian nonparametric priors. Recalling 
that the beta process (BP) (Hjort, 1990; Thibaux & Jordan, 
2007) is the random measure underlying the IBP, we refer 
to the objective function in Eq. (6) as the BP -means objec- 
tive. The trace term in this objective forms a K-means-style 
objective on a feature matrix Z and feature means A when 
the number of features (i.e., the number of columns of Z 
or rows of A) is fixed. The second term enacts a penalty of 
A^ for each feature. In contrast to the DP-means objective, 
even the first feature is penalized since it is theoretically 
possible to have zero features. 

BP-means algorithm. We formulate a BP-means algo- 
rithm to solve the optimization problem in Eq. (6) and dis- 
cuss its convergence properties. In the following, note that 
Z' Z is invertible so long as two features do not have the 
same collection of indices. In this case, we simply combine 
the two features into a single feature before performing the 
inversion. 



BP-means algorithm. Iterate the following two steps 
until no changes are made: 
1. For n = 1, . . . , 

• For k = 1, . . . , choose the optimizing value 
(0 or l)of Znk- 

• Let Z' equal Z but with one new feature (labeled 

+ 1) containing only data index n. Set A' = A 



but with one new row: A 



• If the triplet (i^+ + 1,Z\A') lowers the objec- 
tive from the triplet Z, A), replace the latter 
triplet with the former. 

2. Set A ^ {Z'Z)-^Z'X. 



Proposition 1. The BP-means algorithm converges after a 
finite number of iterations to a local minimum of the BP- 
means objective in Eq. (6). 

See Sup. Mat. F for the proof. Though the proposition guar- 
antees convergence, it does not guarantee convergence to 
the global optimum — an analogous result to those avail- 
able for the K-means and DP-means algorithms (Kulis & 
Jordan, 2012). Many authors have noted the problem of lo- 
cal optima in the clustering literature (Steinley, 2006; Jain, 
2010). One expects that the issue of local optima is only 
exacerbated in the feature domain, where the combinatorial 
landscape is much more complex. In clustering, this issue 
is often addressed by multiple random restarts and careful 
choice of cluster initialization; in Section 5 below, we also 
make use of random algorithm restarts and propose a fea- 
ture initialization akin to one with provable guarantees for 
K-means clustering (Arthur & Vassilvitskii, 2007). 

4. Extensions 

A number of variations on the Gaussian mixture poste- 
riors are of interest in both nonparametric and paramet- 
ric Bayesian inference. We briefly demonstrate that our 
methodology applies readily to several such situations. 

4.1. Collapsed objectives 

It is believed in many scenarios that collapsing out the clus- 
ter or feature means from a Bayesian model by calculat- 
ing instead the marginal structural posterior can improve 
MCMC sampler mixing (Liu, 1994). 

Clustering. In the clustering case, collapsing translates to 
forming the posterior P(z|a::) = J^¥{z^ iJ.\x). Note that 
even in the cluster case, we may use the matrix represen- 
tations Z, X, and A so long as we make the additional 

assumption that Ylk=i ^^k = 1 for each n. Finding the 
MAP estimate argmax^ P(Z|X) may, as usual, be accom- 
plished by minimizing the negative log joint distribution 
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with respect to Z. P(Z) is given by the CRP (Eq. (1)). 
P(X|Z) takes the form (Griffiths & Ghahramani, 2006): 



exp 



2^ 



(7) 



Using the same asymptotics in and Q as before, we find 
the hmiting optimization problem (Sup. Mat. C): 

argmin tr(X'(/Ar - Z{Z'ZY^Z')X) + (i^+ - 1) (8) 

The first term in this objective was previously proposed, 
via independent considerations, by Gordon & Henderson 
(1977). Simple algebraic manipulations allow us to rewrite 
the objective in a more intuitive format (Sup. Mat. C.l): 



argmin > > . 



r(^)||?. 



(i^+-l)A^ (9) 



where x^^^ \— Xlm-^^fc^i empirical 
cluster mean, i.e. the mean of all data points assigned to 
cluster k. This collapsed DP -means objective is just the 
original DP-means objective in Eq. (3) with the cluster 
means replaced by empirical cluster means. 

A corresponding optimization algorithm is as follows. 



Collapsed DP-means algorithm. Repeat the following 
step until no changes are made: 
1. For n = 1, . . . , 

• Assign Xn to the closest cluster if the contribution 
to the objective in Eq. (9) from the squared distance 
is at most . 

• Otherwise, form a new cluster with just x^- 



A similar proof to that of Kulis & Jordan (2012) shows that 
this algorithm converges in a finite number of iterations to 
a local minimum of the objective. 

Feature allocation. We have already noted that the like- 
lihood associated with the Gaussian mixture model condi- 
tioned on a clustering is just a special case of the linear 
Gaussian model conditioned on a feature matrix. There- 
fore, it is not surprising that Eq. (7) also describes P(X|Z) 
when Z is a feature matrix. Now, P(Z) is given by the IBP 
(Eq. (4)). Using the same asymptotics in and 7 as in the 
joint MAP case, the MAP problem for feature allocation Z 
asymptotically becomes (Sup. Mat. D): 



argmin tr(X'(/Ar 



Z{Z'Z)-^Z')X) + i^+A^ (10) 



The key difference with Eq. (8) is that here Z may have any 
finite number of ones in each row. We call the objective in 
Eq. (10) the collapsed BP -means objective. 



Just as the collapsed DP-means objective had an empiri- 
cal cluster means interpretation, so does the collapsed BP- 
means objective have an interpretation in which the feature 
means matrix A in the BP-means objective (Eq. (6)) is re- 
placed by its empirical estimate {Z' Z)~^Z. In particular, 
we can rewrite the objective in Eq. (10) as 

iY[{X-Z{Z'Z)-^Z'X)\X-Z{Z'Z)-^Z'X)]^K''\^. 

A corresponding optimization algorithm is as follows. 



Collapsed BP-means algorithm. Repeat the following 
step until no changes are made: 
1. Forn = 1,...,A^ 

• Choose Zn^i.K+ to minimize the objective in 
Eq. (10). Delete any redundant features. 

• Add a new feature (indexed + 1) with only data 
index n if doing so decreases the objective and if 
the feature would not be redundant. 



A similar proof to that of Proposition 1 shows that this 
algorithm converges in a finite number of iterations to a 
local minimum of the objective. 

4.2. Parametric objectives 

The generative models studied so far are nonparametric in 
the usual Bayesian sense; there is no a priori bound on 
the number of cluster or feature parameters. The objec- 
tives above are similarly nonparametric. Parametric mod- 
els, with a fixed bound on the number of clusters or fea- 
tures, are often useful as well, and we explore these here. 

First, consider a clustering prior with some fixed maximum 
number of clusters K. Let qi-^K represent a distribution 
over clusters. Suppose qi._K is drawn from a finite Dirich- 
let distribution with size K > 1 and parameter ^ > 0. 
Further, suppose the cluster for each data point is drawn iid 
according to qi.K- Then, integrating out q, the marginal 
distribution of the clustering is Dirichlet-multinomial: 



P(z) 



T{Ke) 

r(7V + KO) 



K 

n 



N,k ■ 



0) 



m 



(11) 



We again assume a Gaussian mixture likelihood, only now 
the number of cluster means /i/e has an upper bound of K. 

We can find the MAP estimate of z and /i under this model 
in the limit 0. With fixed, the clustering prior 

has no effect, and the resulting optimization problem is 
argmin^^^ Ef=i En:^^,=i ll^n - M/clP, which is just the 
usual K-means optimization problem. 

We can also try scaling = exp(— A^/(2cr^)) for some 
constant A^ > as in the unbounded cardinality case. Then 
taking the ^ limit of the log joint likelihood yields 
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a term of for each cluster containing at least one data 
index in the product in Eq. (11) — except for one such clus- 
ter. Call the number of such activated clusters . The 
resulting optimization problem is 

K 

argmin^ ^ -/i^f + (K AK+ - 1)A^ (12) 

This objective caps the number of clusters at K but contains 
a penalty for each new cluster up to iiT. 

A similar story holds in the feature case. Imagine that we 
have a fixed maximum of K features. In this finite case, 
we now let qi-K represent frequencies of each feature and 
let qk Beta(7, 1). We draw z^k ^ Bern(g'/c) iid across n 
and independently across k. The linear Gaussian likelihood 
model is as in Eq. (5) except that now the number of fea- 
tures is bounded. If we integrate out the qi:K, the resulting 
marginal prior on Z is 

^ f r{SN,k + 7)r(A^ - SN,k + 1) r(7 + 1) \ 

r(iv + 7 + i) r(7)r(i);- ^ 

Then the limiting MAP problem as ^ is 

argmintr[(X - ZAy{X - ZA)]. (14) 

Z,A 

This objective is analogous to the K-means objective but 
holds for the more general problem of feature allocations. 

Eq. (14) can be solved according to a K-means- style 
algorithm. Notably, in the following algorithm, all of the 
optimizations for n in step 1 may be performed in parallel. 

K-features algorithm. Repeat the following steps until 

no changes are made: 

1. For n = 1, . . . , 

• For /c = 1, i^T, set Zn,k to minimize — 

Zn,l:KA\\^. 

2. SetA = {Z'Z)-^Z'X. 

We can further set 7 = exp(— A^/(2cr^)) as for the un- 
bounded cardinality case before taking the limit ^ 0. 
Then a A^ term contributes to the limiting objective for each 
non-empty feature from the product in Eq. (13). The result- 
ing objective is 

argmintr[(X - ZAy{X - ZA)] + (K A K+)A^ (15) 

K+,Z,A 

reminiscent of the BP-means objective but with a cap of K 
possible features. 

5. Experiments 

We examine collections of unlabeled data to discover latent 
shared features. We have already seen the BP-means and 



collapsed BP-means algorithms for learning these features 
when the number of features is unknown. A third algo- 
rithm that we evaluate here involves running the K-features 
algorithm for different values of K and choosing the joint 
values of Z, A that minimize the BP-means objective in 
Eq. (6); we call this the stepwise K-features algorithm. If 
we assume the plot of the minimized K-features objective 
(Eq. (14)) as a function of K has increasing increments, 
then we need only run the K-features algorithm for increas- 
ing K until the objective decreases. 

It is well known that the K-means algorithm is sensitive to 
the choice of cluster initialization (Pena et al., 1999). Po- 
tential methods of addressing this issue include multiple 
random initializations and choosing initial, random cluster 
centers according to the K-means++ algorithm (Arthur & 
Vassilvitskii, 2007). In the style of K-means++, we intro- 
duce a similar feature means initialization. 

We first consider fixed K. In K-means++, the initial clus- 
ter center is chosen uniformly at random from the data set. 
However, we note that empirically, the various feature al- 
gorithms discussed tend to prefer the creation of a base fea- 
ture, shared amongst all the data. So start by assigning ev- 
ery data index to the first feature, and let the first feature 
mean be the mean of all the data points. Recursively, for 
feature k with k > 1, calculate the distance from each data 
point Xn,- to its feature representation Zn,.A for the con- 
struction thus far. Choose a data index n with probability 
proportional to this distance squared. Assign A^^. to be the 
nth distance. Assign Zm,k for all m = 1, . . . , A/" to opti- 
mize the K-features objective. In the case where K is not 
known in advance, we repeat the recursive step as long as 
doing so decreases the objective. 

Another important consideration in running these algo- 
rithms without a fixed number of clusters or features is 
choosing the relative penalty effect A^. One option is to 
solve for A^ from a proposed K value via a heuristic (Kulis 
& Jordan, 2012) or validation on a data subset. Rather than 
assume K and return to it in this roundabout way, in the fol- 
lowing we aim merely to demonstrate that there exist rea- 
sonable values of A^ that return meaningful results. More 
carefully examining the translation from a discrete (K) to 
continuous (A^) parameter space may be a promising direc- 
tion for future work. 

5.1. Tabletop data 

Using a LogiTech digital webcam, Griffiths & Ghahramani 
(2006) took 100 pictures of four objects (a prehistoric han- 
daxe, a Klein bottle, a cellular phone, and a $20 bill) placed 
on a tabletop. The images are in JPEG format with 240 
pixel height, 320 pixel width, and 3 color channels. Each 
object may or may not appear in a given picture; the ex- 
perimenters endeavored to place each object (by hand) in a 
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Figure 1 . Left. A comparison of results for the IBP Gibbs sampler 
(Griffiths & Ghahramani, 2006), the collapsed BP-means algo- 
rithm, the basic BP-means algorithm, and the stepwise K-features 
algorithm. The first column shows the time for each run of the 
algorithm in seconds; the second column shows the total running 
time of the algorithm (i.e., over multiple repeated runs for the fi- 
nal three); and the third column shows the final number of features 
learned (the IBP # is stable for > 900 final iterations). Right: A 
histogram of collections of the final K values found by the IBP 
for a variety of initializations and parameter starting values. 
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Figure 2. Upper row. Four example images in the tabletop data 
set. Second row: Feature assignments of each image. The first 
feature is the base feature, which depicts the Klein bottle and $20 
bill on a tabletop and is almost identical to the fourth picture in the 
first row. The remaining four features are shown in order in the 
third row. The fourth row indicates whether the picture is added 
or subtracted when the feature is present. 



respective fixed location across pictures. 

This setup lends itself naturally to the feature allocation do- 
main. We expect to find a base feature depicting the table- 
top and four more features, respectively corresponding to 
each of the four distinct objects. Conversely, clustering on 
this data set would yield either a cluster for each distinct 
feature combination — a much less parsimonious and less 
informative representation than the feature allocation — or 
some averages over feature combinations. The latter case 
again fails to capture the combinatorial nature of the data. 

We emphasize a further point about identifi ability within 
this combinatorial structure. One "true" feature allocation 
for this data is the one described above. But an equally 
valid allocation, from a combinatorial perspective, is one 
in which the base feature contains all four objects and the 
tabletop. Then there are four further features, each of which 
deletes an object and replaces it with tabletop; this allows 
every possible combination of objects on the tabletop to be 
constructed from the features. Indeed, any combination of 
objects on the tabletop could equally well serve as a base 
feature; the four remaining features serve to add or delete 
objects as necessary. 

We run PGA on the data and keep the first D = 100 prin- 
cipal components to form the data vector for each image. 
This pre-processing is the same as that performed by Grif- 
fiths & Ghahramani (2006), except the authors in that case 
first average the three color channels of the images. 

We consider the Gibbs sampling algorithm of Griffiths & 
Ghahramani (2006) with initialization (mass parameter 1 
and feature mean variance 0.5) and number of sampling 
steps (1000) determined by the authors; we explore alterna- 
tive initializations below. We compare to the three feature 
means algorithms described above — all with = 1. Each 
of the final three algorithms uses the appropriate variant 



of greedy initialization analogous to K-means++. We run 
1000 random initializations of the collapsed and BP-means 
algorithms to mitigate issues of local minima. We run 300 
random initializations of the K-features algorithm for each 
value of K and note that K = 2, . . . , 6 are (dynamically) 
explored by the algorithm. All code was run in Matlab on 
the same computer. Timing and feature count results are 
shown on the left of Figure 1 . 

While it is notoriously difficult to compare computation 
times for deterministic, hard- assignment algorithms such 
K-means to stochastic algorithms such as Gibbs sampling, 
particularly given the practical need for reinitialization to 
avoid local minima in the former, and difficult-to-assess 
convergence in the latter, it should be clear from the first 
column in the lefthand table of Figure 1 that there is a ma- 
jor difference in computation time between Gibbs sampling 
and the new algorithms. Indeed, even when the BP-means 
algorithm is run 1000 times in a reinitialization procedure, 
the total time consumed by the algorithm is still an order of 
magnitude less than that for a single run of Gibbs sampling. 
We note also that stepwise K-features is the fastest of the 
new algorithms. 

We further note that if we were to take advantage of par- 
allelism, additional drastic advantages could be obtained 
for the new algorithms. The Gibbs sampler requires each 
Gibbs iteration to be performed sequentially whereas the 
random initializations of the various feature means algo- 
rithms can be performed in parallel. A certain level of par- 
allelism may even be exploited for the steps within each it- 
eration of the collapsed and BP-means algorithms while the 
Zn,i:K optimizations of repeated feature K-means may all 
be performed in parallel across n (as in classic K-means). 

Another difficulty in comparing algorithms is that there is 
no clear single criterion with which to measure accuracy of 
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the final model in unsupervised learning problems such as 
these. We do note, however, that theoretical considerations 
suggest that the IBP is not designed to find either a fixed 
number of features as N varies nor roughly equal sizes in 
those features it does find (Broderick et al., 2012). This ob- 
servation may help explain the distribution of observed fea- 
ture counts over a variety of IBP runs with the given data. 
To obtain feature counts from the IBP, we tried running in a 
variety of different scenarios — combining different initial- 
izations (one shared feature, 5 random features, 10 random 
features, initialization with the BP-means result) and dif- 
ferent starting parameter values^ (mass parameter values 
ranging logarithmically from 0.01 to 1 and mean-noise pa- 
rameter values ranging logarithmically from 0.1 to 10). The 
final hundred K draws for each of these combinations are 
combined and summarized in a histogram on the right of 
Figure 1 . Feature counts lower than 7 were not obtained in 
our experiments, which suggests these values are, at least, 
difficult to obtain using the IBP with the given hyperpriors. 

On the other hand, the feature counts for the new K-means- 
style algorithms suggest parsimony is more easily achieved 
in this case. The lower picture and text rows of Figure 2 
show the features (after the base feature) found by feature 
K-means: as desired, there is one feature per tabletop ob- 
ject. The upper text row of Figure 2 shows the features 
to which each of the example images in the top row are 
assigned by the optimal feature allocation. For compari- 
son, the collapsed algorithm also finds an optimal feature 
encoding. The BP-means algorithm adds an extra, super- 
fluous feature containing both the Klein bottle and $20 bill. 

5.2. Faces data 

Next, we analyze the FEI face database, consisting of 400 
pictures of pre-aligned faces (Thomaz & Giraldi, 2010). 
200 different individuals are pictured, each with one smil- 
ing and one neutral expression. Each picture has height 300 
pixels, width 250 pixels, and one grayscale channel. Four 
example pictures appear in the first row of Figure 3. This 
time, we compare the repeated feature K-means algorithm 
to classic K-means. We keep the top 100 principal compo- 
nents to form the data vectors for both algorithms. 

With a choice of = 5, repeated feature K-means chooses 
one base feature (lefthand picture in the second row of Fig- 
ure 3) plus two additional features as optimal; the central 
and righthand pictures in the second row of Figure 3 depict 
the sum of the base feature plus the corresponding feature. 
The base feature is a generic face. The second feature codes 
for longer hair and a shorter chin. The third feature codes 
for darker skin and slightly different facial features. The 
feature combinations of each picture in the first row appear 

^We found convergence failed for some parameter initializa- 
tions outside this range 
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Figure 3. First row: Four sample faces. Second row: The base 
feature (left) and other 2 features returned by repeated feature K- 
means with = 5. The final pictures are the cluster means from 
K-means with K — {third row) and K — A {fourth row). The 
righthand text shows how the sample pictures (left to right) are 
assigned to features and clusters by each algorithm. 



in the first text row on the right; all four possible combina- 
tions are represented. 

K-means with 2 clusters and feature K-means with 2 fea- 
tures both encode exactly 2 distinct, disjoint groups. For 
larger numbers of groups though, the two representations 
diverge. For instance, consider a 3-cluster model of the 
face data, which has the same number of parameters as the 
3 -feature model. The resulting cluster means appear in the 
third row of Figure 3. While the cluster means appear sim- 
ilar to the feature means, the assignment of faces to clus- 
ters is quite different. The second righthand text row in 
Figure 3 shows to which cluster each of the four first-row 
faces is assigned. The feature allocation of the fourth pic- 
ture in the top row tells us that the subject has long hair 
and certain facial features, roughly, whereas the clustering 
tells us that the subject's hair is more dominant than facial 
structure in determining grouping. Globally, the counts of 
faces for clusters (1,2,3) are (154,151,95) while the counts 
of faces for feature combinations (100,110,101,111) are 
(139,106,80,75). 

We might also consider a clustering of size 4 since there are 
4 groups specified by the 3-feature model. The resulting 
cluster means are in the bottom row of Figure 3, and the 
cluster assignments of the sample pictures are in the bot- 
tom, righthand text row. None of the sample pictures falls 
in cluster 4. Again, the groupings provided by the feature 
allocation and the clustering are quite different. Notably, 
the clustering has divided up the pictures with shorter hair 
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into 3 separate clusters. In this case, the counts of faces 
for clusters (1,2,3,4) are (121,150,74,55). The feature allo- 
cation here seems to provide a sparser representation and 
more interpretable groupings relative to both cluster cardi- 
nalities. 

6. Conclusions 

We have developed a general methodology for obtaining 
hard-assignment objective functions from Bayesian MAP 
problems. The key idea is to include the structural variables 
explicitly in the posterior using combinatorial functions 
such as the EPPF and the EFPF. We apply this methodology 
to a number of generative models for unsupervised learn- 
ing, with particular emphasis on latent feature models. We 
show that the resulting algorithms are capable of modeling 
latent structure out of reach of clustering algorithms but 
are also much faster than existing feature allocation learn- 
ers studied in Bayesian nonparametrics. We have devoted 
some effort to finding algorithmic optimizations in the style 
of K-means (e.g., extending K-means++ initializations) in 
this domain. Nonetheless, the large literature on optimal 
initialization and fast, distributed running of the K-means 
algorithm suggests that, with some thought, the algorithms 
presented here can still be much improved in future work. 
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Supplementary Material 

A. DP-means objective derivation 

First consider the generative model in Section 2. The joint 
distribution of the observed data x, cluster indicators z, and 
cluster means ji can be written as follows. 

P(x, z, ii) = P(x|z, /i)P(z)P(/i) 

= n n j^i^nW^cy'^in) 

k=l n:Zn,k = '^ 



K+ 

H^iflklO^p^lD) 



k=l 



Then set := exp(— A^/(2cr^)) and consider the limit 



0. In the following, f{a^ 



0{g{a^)) de- 



notes that there exist some constants c, 5^ > such that 

\f{a^)\ < c\g{a^)\ for M <s\ 



-logP(x,2;,/i) 

K+ 

= E E 

k=l n:Zn,k = '^ 
A2 



■{K+-1] 
0(1) 



0(1) 



2(t2 



\Xn - Mfel 



It follows that 



K+ 

-2aHogF{x,z,n) = Y, E W^n -Hkf 

k=l n:Zn,k = ^ 

+ (i^+-l)A2 + 0(a2 log(a2)). 

But since cr^ log(cr^) ^ as ^ 0, we have that the re- 
mainder of the righthand side is asymptotically equivalent 
(as cr^ 0) to the lefthand side (Eq. (2)). 



B. BP-means objective derivation 

The recipe is the same as in Sup. Mat. A. This time we 
start with the generative model in Section 3. The joint dis- 
tribution of the observed data X, feature indicators Z, and 
feature means A can be written as follows. 

P(X, Z, A) = P(X|Z, A)F{Z)F{A) 
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exp{-EliS} ^ (Sj,^,-l)\{N-SN,ky. 



n 



AT! 



tr X'{In - Z{Z'Z + —Id)-^Z')X 



Now set 7 := exp(— A^/(2cr^)) and consider the limit 
^ 0. Then 

-logF{X,Z,A) 

= 0(loga2) + -i^tr((X - - ZA)) 



A2 



+ ^ + exp(-AV(2cT2)) ^ + 0(1) 

n=l 

+ 0(1) 
It follows that 

-2(7^ logP(X, Z, A) = tr[(X - ZA)'(X - ZA)] + i^+A^ 

+ O {a^ exp(-AV(2a2))) + 0(^' ^og{a^)). 

But since exp(-A^/(2cr^)) ^ and cr^log(cr^) ^ as 
cr^ 0, we have that -2crMogP(X, Z, A) - tr[(X - 
ZA)'(X-ZA)] +K+A2. 

C. Collapsed DP-means objective derivation 

We apply the usual recipe as in Sup. Mat. A. The generative 
model for collapsed DP-means is described in Section 4.1. 
The joint distribution of the observed data X and cluster 
indicators Z can be written as follows. 

P(X, Z) =F{X\Z)F{Z) 



+ (K+-l)A2 + 0(a2 \og{a^)) 

We note that cr^log(cr^) ^ as ^0. Further note 
that Z^Z is a diagonal K x K matrix with (k^k) entry 
(call it SN,k) equal to the number of indices in cluster k. 
Z'Z is invertible since we assume no empty clusters are 
represented in Z. Then 

-2a^ logP(X, Z) 

- tr (X'(/^ - Z{Z'Z)-^Z')X) + - 1)A2 
as cr^ 0. 

C.l. More interpretable objective 

The objective for the collapsed Dirichlet process is more 
interpretable after some algebraic manipulation. We de- 
scribe here how the opaque tr (X'(/Ar - Z{Z' Z)'^ Z')X) 
term^can be written in a form more reminiscent of the 

ZlfeLi ^n-.zr^ k=i 11^^ ~ '^^ll^ uncollapsed ob- 

jective. First, recall that C := Z^ Z is sl K x K matrix 
with Ck,k = SN,k and Cj^k = for j / k. Then := 
Z{Z'Z)-^Z' is an X X X matrix with C; = S'^ if 
and only if Zn,k = Zm,k = 1 and ^ = if Zn,k + ^m.k- 

tr{X\lN-Z{Z'Z)-^Z')X) 
= tr(X'X) - tr(X'Z(Z'Z)-^Z'X) 

D K+ 

= tr(XX') - X] Yl SN\^n,d^m,d 



d=l k=l n:Zn fc = l rniZm fc = l 



■ exp |-^tr (^X'{Im - Z{Z'Z + ^Id)-'Z')X^ | 



Now set 9 := exp(— A^/(2cr^)) and consider the limit 
0-2 -> 0. Then 

-\ogV{X,Z) = 0{\og{a^)) 

+ ^tr (^X'(J^ - ZiZ'Z + '^Id)-^Z')X^ 

+ (i^+-l)^+0(l) 

It follows that 

-2(7^ logF(X, Z) 



K+ 

E 

/c=l 



E E II- 
E E II- 



for cluster- specific empirical mean x^^^ := 

D. Collapsed BP-means objective derivation 

We continue to apply the usual recipe as in Sup. Mat. A. 
The generative model for collapsed BP-means is described 
in Section 4.1. The joint distribution of the observed data 
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X and feature indicators Z can be written as follows. 

W{X, Z) = V{X\Z)V{Z) 



K + 

.1 =n n •^(x„iMfe,<T'Sfc) 



7^%xp{-Eti^} ^ ^SN,k-mN-SN,k)\ 



k—1 n:Zn,k — '^ 
k=l 



2'^'D/2r^(j,/2)' 



n 

/c=l 



.exp|-itr($S^i)| 



Now set 7 := exp(— A^/(2cr^)) and consider the limit 
0-2 -> 0. Then 

-logP(X,Z)=0(log(cT2)) 

+ ^tr - + ^Id)-'Z')X^ 



where F/:) is the multivariate gamma function. Consider 
the Hmit 0. Set u = A^/a^ for some constant : 

A^ > 0. Then 



AT 



It follows that 



^ + exp(-AV(2a^)) ^ + 0(1) 



-logP(x,Z,/i,cr^S) 

= E E 

k=l n:Zn,k = '^ 
K 

+ 0(1) + ^ 



k=l ^ 



^A^log|*| + ^A^log2 



+ logrz,(A7(2<T^))+(^ + 



A2 D + 1 



log|Efc|+0(l) 



-2(7^ logP(X, Z)=tr{ X'{Im - Z{Z'Z + ^Id)-^Z')X 



So we find 



+ K+A^ + O (a^ exp(-AV(2a2))) + 0{cj^ \og{cj^)). 

Butexp(-AV(2cr2)) ^ Oanda^ log(o-2) ^ Oasa^ ^ 0. 
And Z' Z is invertible so long as two features do not have 
identical membership (in which case we collect them into 
a single feature). So we have that — 2cr^ logP(X, Z) ~ 
tr [X'{In - Z{Z'Z)-^Z')X) + K+\^. 

E. General multivariate Gaussian likelihood 

Above, we assumed a multivariate spherical Gaussian like- 
lihood for each cluster. This assumption can be general- 
ized in a number of ways. For instance, assume a general 
CO variance matrix a^E/e for positive scalar and positive 
definite D x D matrix E^. Then we assume the follow- 
ing likelihood model for data points assigned to the kth 
cluster (Zn,/c = 1): Xn ~ JV{fik^ cr'^^k)- Moreover, as- 
sume an inverse Wishart prior on the positive definite ma- 
trix E/c! E/c ^ VF(<I>~^, v) for <l> a positive definite matrix 
and u > D — 1. Assume a prior P(/i) on /i that puts strictly 
positive density on all real- valued I^-length vectors /i. For 
now we assume K is fixed and that F{z) puts a prior that 
has strictly positive density on all valid clusterings of the 
data points. This analysis can be immediately extended to 
the varying cluster number case via the reasoning above. 
Then 

P(x,Z,/i,Cr^E) 



-2a^ [log P(x, z, /i, a^E) + log rn{XV{2a^))] 

K 

^ {Xn- l^kY^k^iXn- l^k) 



k=l n:Zn,k = '^ 
K 



^A2log|Efe|+0(<72). 



k = l 



Letting ^ 0, the righthand side becomes 



K 



K 



^ ^ (Xn-/ife)'^feH^n-Mfe) + ^A^log|E/e|, 

k-=ln:Zn,k — '^ k=l 

the final form of the objective. 

If the E/c are known, they may be inputted and the objec- 
tive may be optimized over the cluster means and cluster 
assignments. In general, though, the resulting optimization 
problem is 



mm 

Z,fI,Ti 



K 

E 

k=i 



^ {Xn - llk)'^k^{Xn - /J^k) + log |E/e| 



That is, the squared Euclidean distance in the classic K- 
means objective function has been replaced with a Maha- 
lanobis distance, and we have added a penalty term on the 
size of the E/^ matrices (with A^ modulating the penalty as 
in previous examples). This objective is reminiscent of that 
proposed by Sung & Poggio (1998). 



12 



F. Proof of BP-means local convergence 

The proof of Proposition 1 is as follows. 

Proof. By construction, the first step in any iteration does 
not increase the objective. The second step starts by delet- 
ing any features that have the same index collection as 
an existing feature. Suppose there are m such features 
with indices J and we keep feature k. By setting Ak^. ^ 
^j^j Aj^., the objective is unchanged. Next, note 

VAtr[(X - ZAy{X - ZA)] = 2Z\X - ZA). (16) 

Setting the gradient to zero, we find that A = 
{Z' Z)~^ Z' X solves the equation for A and therefore min- 
imizes the objective with respect to A when Z' Z is invert- 
ible, as we have already guaranteed. 

Finally, since there is only a finite number of feature allo- 
cations in which each data point has at most one feature 
unique to only that data point and no features containing 
identical indices (any extra such features would only in- 
crease the objective due to the penalty), the algorithm can- 
not visit more than this many configurations and must finish 
in a finite number of iterations. □ 
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